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1 Introduction 


In many fields of sciences, including environmental, biological and ecological studies, 
measurement of the variable of interest is usually time-consuming and/or costly. 
Under simple random sampling (SRS), this leads to samples of small sizes with 
which statistical inference is less reliable. Thus, if the situation permits, a more 
efficient sampling technique may be desirable. A proper choice in such situations is 
ranked set sampling (RSS) which was originally introduced by McIntyre (1952) in the 
context of an agricultural experiment. The application of RSS in many other areas 
of science is studied by many researchers, including Kvam (2003), who applied RSS 
for environmental monitoring, Halls and Dell (1966), who mentioned the application 
of RSS in forestry and Chen et al. (2005) who applied RSS in medicine. 

RSS is highly applicable when a few units in a set are easily ranked without full 
measurement. One of the judgment ranking methods is to use an auxiliary variable 
for ranking. The auxiliary variable, is often highly correlated with the variable 
of interest and its measurement is usually inexpensive and fast. For example, in a 
clinical trial, the measurement of the body fat percentage variable is time consuming 
and expensive. However, the measurement of the body circumference (waist), which 
is highly correlated with the body fat percentage, is inexpensive and fast. This 
auxiliary variable may be used to extract a ranked set sample of the values of the 
variable of interest. The judgment ranking is based on any means not involving 
formal measurement, e.g. visual ranking or concomitant variable. 

Many authors have introduced extensions of RSS to construct improved estima¬ 
tors of different population attributes. Ozturk (2011) proposed some variations on 
RSS in which rankers are permitted to declare ties. Multistage ranked set sampling 
(MRSS), proposed by Al-Saleh and Al-Omari (2002), is another example of such a 
design. The sampling methodology of MRSS is described in the following algorithm. 
Balanced RSS is simply the case r = 1 of the following algorithm. 

1. Randomly identify k r+1 , (r > 1) units from the target population and allocate 
them randomly in k r ~ 1 sets, each of size k 2 . 

2. For each set in step 1, apply judgement ordering, by any cheap method not 
involving formal measurement, on the elements of the ?'th (i = 1,..., k) sample 
and identify the ith (judgement) smallest unit, to get a (judgement) ranked 
set of size k. This step gives k r ~ l (judgement) ranked sets, each of size k. 

3. Repeat step 2, r — 1 times and apply each ranking stage on the ranked sets of 
its previous stage. A ranked set of size k is acquired in the (r — l)th stage. 

4. Actually measure the k identified units in step 3. 

5. Repeat steps 1-4, m > 1 times, to obtain a sample of size n = mk. 
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We denote the sample collected using MRSS by {X^ 3 : i — 1..., k; j — 1,..., m}, 
where X^ 3 is the ith judgement order statistic from the j'tli cycle. The special case 
r = 2 of MRSS (Al-Saleh and Al-Kadiri, 2000) is known as double ranked set sam¬ 
pling (DRSS). 

The ranked set sample contains not only measurements on the quantified units 
but also additional information on their ranks. Since RSS provides more structured 
samples than SRS, improved inference may result from the use of RSS design. 

Investigation of RSS in nonparametric settings has attracted attention of many 
researchers. Dell and Clutter (1972) formally showed that the sample mean using 
RSS is an unbiased estimator of the population mean regardless of ranking errors 
and it has a smaller variance than the sample mean using SRS when the number 
of measured units are the same. Stokes (1980) considered estimation of variance 
and showed that improved estimates of variance can be produced in RSS. Stokes 
and Sager (1988) characterized a ranked set sample as a sample from a conditional 
distribution, conditioning on a multinomial random vector, and applied RSS to 
the estimation of the cumulative distribution function. Chen (1999) studied the 
kernel method of density estimation in RSS. Deshpande et al. (2006) have obtained 
nonparametric confidence intervals for quantiles of the population based on a ranked 
set sample. 

Since the concept of differential entropy was introduced in Shannon (1968)’s pa¬ 
per, it has been of great interest because of wide applications specially in signal 
analysis and analysis of neural data and also because of its interesting theoretical 
properties. The concept of mutual information, defined based on Shannon entropy, 
is also well-known specially as a well-defined dependence measure (see Cover and 
Thomas, 2012 for more details). The mutual information has several advantages 
compared to the classical dependence measures, such as Pearson, Spearman and 
Kendall’s correlation. The classical dependence criteria usually measure the corre¬ 
lation between two variables. The Pearson correlation criterion only measures the 
linear dependence and the Spearman and Kendall’s correlation criteria use only the 
ranks of the observations. The mutual information enables us to measure whether 
linear or nonlinear dependency between two or more variables. 

Joe (1989) studied the large sample properties of the kernel based estimator of the 
entropy based on SRS. He also obtained the optimal bandwidth and a proper kernel 
function. In this paper, we are specially interested in estimating the entropy based 
on RSS and MRSS by adapting the work of Joe (1989). The proposed estimator 
is applied to estimate the mutual information of variables as well as the Kullback- 
Leibler divergence measure. We compare the proposed kernel based estimator of the 
entropy based on RSS and MRSS samples with that of Joe (1989) by approximating 
relative efficiency of the proposed estimator with respect to its competitor in SRS. 
We observe that the RSS and MRSS methods, perfectly improve the kernel based 
estimation of the entropy. 
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The outline of this paper is as follows. Section 2 introduces a nonparametric 
estimator of the entropy in RSS. The approximated bias and mean square error 
(MSE) of the proposed estimator are derived. The choice of kernel and the optimal 
bandwidth is discussed and estimation of the MSE of the entropy estimator and 
approximating the efficiency of the estimator relative to the corresponding estimator 
in SRS are studied. The applications of the proposed estimator for estimation of 
the mutual information and the Kullback-Leibler divergence criteria is discussed in 
Section 3. A simulation study and two real data examples are presented in Section 
4 to examine performance of the proposed estimators and to illustrate applicability 
of theoretical results. 


2 The proposed estimator 

Let fx be a p-variate (p > 1) probability density function (pdf) with respect to 
Lebesgue measure /z with distribution function F x . We consider estimation of 

H{X) = - [ fxhgfx d/j, 

Js 

using a ranked set sample, where S' is a bounded set, such that fx is bounded below 
on it by a positive constant and f s fx log f x d/z ~ f KP f x log fx d/z. 

We assume that fx has continuous first and second derivatives, which are dom¬ 
inated by integrable functions. 

For each i, are independent and identically distributed to 

For p = 1 and the ordinary RSS (r = 1), when the ranking is perfect, Xyj is 
identically distributed to the ztli order statistic in a sample of size k, from the 
parent distribution F, while Xyy ..., are independent. For p = 2, if the values 
of X = (X^\ X^) are ranked by then the distribution of Xyy is identical with 
the joint distribution of the ith order statistic and its concomitant in an iid sample 
of size k from fx- For the perfect ranking and DRSS case (r = 2), the distribution 
of Xu] is identical with that of the zth order statistic and its concomitants of the 
independent but not identically distributed sample Xqj,..., Xyy. 

We assume throughout that the ranked set sample X^j, i = 1 ,...,k, j = 
1 ,,m is consistent, that is 

(Cl) For each x, \ Y!l=i f x [{] (x) = F x (x). 

It is well known that if the ranking of X is performed using one of its components, 
then both RSS and MRSS samples are consistent (see e.g., Al-Saleh and Al-Omari 
(2002) for a proof of the consistency for the DRSS case). 
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For any t in support of X, the kernel density estimate of fxif) based on the 
ranked set sample X^ v i — 1 ,..., k, j — 1,..., m, is (Chen, 1999) 


k m 


' i =1 3 =1 


t - X 


[i]j 


7 


where is a kernel function and 7 is a positive bandwidth. We assume throughout 
that K p (ui,... ,u p ) = UU ^o( u j) an d a univariate kernel function satisfying 
the following conditions: 

(C 2 ) k Q (u) = k 0 (—u); (C3) f k 0 (u ) dw = 1 and (C4) J u 2 k 0 (u ) dw = 1 . 

Let 

^ k m 


u RSS (<) = - £ £s<) 


*=1 j=i 


be the estimator of A in RSS, then 


/f S W = 


-/C 




t — M 

7 


di^ ss (u). 


We propose the estimator of the entropy of X, H(X), in RSS based on X^ v i = 
1 , • ■ ■ ,k, j = 1 , ■ ■ ■ ,m, as 


1 fc m r> 

h ™ s ( x ) = -inn io «/» RSS ( Jf [iu) / s( x i«) = - / i°s/r 

i=l j=l 1/5 


(*) dA n RSS (x). 


The asymptotic properties of H^ RS (X) are studied by Joe (1989). In the next 
section, we study the asymptotic properties of H RSS (X). 


2.1 Characteristics of the estimator 

In this section, we approximate MSE and bias of H RSS (X). With a suitable choice 
of bandwidth 7 (see Section 4), we deduce that MSE (H RSS (X)) is 0 ( 77 , _1 ), that 
is H RSS (X) is a consistent estimator of H(X). To prove the results, we need the 
following lemma, which is an analog of Lemma 2.1 of Joe (1989). 

Lemma 1 Let U n (x) = y/n(F RSS (t) — F(t)), then, under the condition Cl, 

(i) for any integrable function a(x) with E(a(X)) < 00 , we have E J a(x) d U n (x) = 

0 ; 
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(ii) for any integrable function a(x,y), we have 



E / / a(x, y) d U n (x) dU n (y) = / a(x,x) d F x (x) 


i=1 



I I a ( x ^y)fx li] {x)fx li] (y) dx dy. 


Proof See the Appendix. 

By using Lemma [TJ we get the following result. 

Theorem 1 (i) The bias of H^ SS (X) is 

E (H* SS {X)) - H(X) = (H y - H(X)) + n _1 Q!i + 0(n“ 3/2 ), 

where Hy = -E (I S (X) log(A 7 (X))) ? A 7 (x) = E (±K (^f) I s 


« 1 = B(x,x)dFx{x)--^2 / / B(x,y)f X[i] (x)f X[{[ (y) dx dy, 


i= i JsJs 


and 


_! r ±K, ( ? ) X, 
{x ' v) 2 Js (A 7 (z))’ 


d-K r> f ^ 

d F x (z)+ ' y ) 7 ' l s (x). (1) 

Ay(X) 


ii) The MSE of H™ S (X) is 

E (H* SS (X)-H(X)) 2 = {H 1 -H{X)) 2 +n-\a 2 -2a 1 {Hy-H{X)))+0{n-^ 2 ), 

( 2 ) 

where 

«2 = l s A2 ( x ) dFx ( x ) ~ \ (y f s A ( a; ) di ^w( x )) ’ 

and 


r \K V ^ 

A ( x ) = j s A 7 dF x (y) + log(A y(x))I s (x) (3) 


Proof See the Appendix. 
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2.2 Choosing the optimal bandwidth and kernel 

To dioose an optimal value for the bandwidth and a suitable kernel function, note 
that 


ai = 6» + 7 P K( 0) J A 7 (x) 1 dF x (x) - j p y j' A* (z)A 7 (z) 2 d F x (z) 
= e ~ r ’ (f “ A ' (0) ) / ^M" 1 dfx(x) 

-7- p f [ (a;w - A 7 ( 2 ))A 7 ( Z )- 2 dFx(x), 

where 

k 

6 = f s , f s 2/)/xT[i] ( a; )/%] (j/) da; d 2b 

A*(z) = E (I s (X)-f~ p K 2 ((z - X)/ 7 )) /k 2 and k 2 = f K 2 (u ) dw. 

The term 0 is 0(1). In addition, 


A;{z)-A 1 {z) = (2 1 2 )-hrf'^z) 


v 2 kl(y) 

^02 


dn — 1 


+ 0 ( 7 2 ), 


where k 0 2 = / ^o(' u ) d<u - 

Thus, if 

and 


fco(O) 


k 02 

2Vp 



du 


1, 


then, we have 

cm = 0(l) + 0(7 2 - p ). 

Joe (1989) showed that the kernel function of the form 


k 0 (u) = 
1) and (E 


Vi + r)i\ u \i M<6 
r)3~r)4 U l 6 < \u\ < f 2 . 

and computed r/, ; , , i = 1,..., 4, j 


satisfies conditions 
V = 1, • • - ,4. 

For p < 2, a competitor of the kernel function in (EJ) would be 


jfe 0 (u) = (47r)- a5 exp{-u 2 /4}, 


(4) 

(5) 

( 6 ) 

1,2 for 


(7) 


which satisfies condition (Ej) . 
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Using (J7J), we have, for p < 2 

« 1 = 0(l) + 0( 7 - p ). 

On the other hand, since — H(X) = 0( 7 2 ), for the kernel function in ((HD, we 


get 


and 


E(tf n RSS (X) - H(X)Y = 0(n~ l ) + 0(n~ V) + 0(n~ V^) + 0( 7 4 ) 




E(H™\X) - H(X)) = Ofn" 1 ) + 0(n~W- p ) + 0( 7 2 ), 
and for the kernel function in (JTJ). we get 

E(tf RSS (X) - H(X )) 2 = O^" 1 ) + 0(n _1 7 2 ) + 0(n-y- p ) + 0( 7 4 ) 

and 

E(i7 RSS (A") - H{X)) = 0(n~ l ) + 0{n-^) + 0( 7 2 ). 

Using the kernel function in (15]) , for p < 2, by setting 7 = O(t7, _1// ( 2 +0-5p)^ we 


deduce that 


and 


E(if RSS (A) - H(X)) 2 = 0(n~ l ) + 0(n~ 4/(2+a5p) ) 


E(i7 RSS (A) - H(X)) = O(n~ 2/(2+0 - 5p) ). 

Using the kernel function (171) . for p < 2, by setting 7 = O(n~ 1/,( - 2+0 ' 5p Y, we have 
E(TT RSS (A) - H(X)) 2 = Oin- 1 ) + 0(n- (4 - a5p )/( 2+a5p) ) 

and 

E(i/ RSS (X) - H(X)) = O (77, _(2_0 - 5 U/( 2 + 0 ' 5 p) ) _ 

Thus, using both kernels in (J5]) and (|7J, for p < 2, E(i/ RSS (X) — H(X)) 2 = 
Although the rate of convergence of the bias is somehow higher using 
the kernel in (161) . simulations show that, for p < 2, using the kernel function in (171) . 
results in better performance of the entropy estimator than using the kernel function 
in (!6l) . specially for smaller sample sizes. 

From the above discussion, for both kernel functions in (JJJ) and (JHJ) and for p < 2, 
the optimal bandwidth is of the form 


7 = c n 2+0 - 5 p , 

where c depends on the parameters of fx- 

To discuss about the suitable value of c, first, we study the behavior of the 
optimal value of 7 through simulation, obtained by minimizing the cross-validation 
criterion 


C% = 


1 


m 


V (H* SS (X) - s pr<- 7 ) 2 , 


3 =1 


( 8 ) 



where X ( ^ = {X [j]h i = j) = 1,.. •, m}. Simulations show that the 

bandwidth should decrease with more dependence for a bivariate density. Therefore, 
if £ is the covariance matrix of the population, then c increases by increasing 6 = 

• For the multivariate normal distribution with covariance matrix £ (with 


|£| not too close to zero), a rough approximation of # 2 + 0i5 p is 


0.5—/jj 0(a:,£)d2: 

T ' - 

p-^+u.bp ( 0 . 5 — 0 . 5 p) 


, where 


R = [—0.674, 0.674] p (see Jones, 1989). Thus, we take the the optimal bandwidth 
to be 


7 = di r^+o^IQR 


0.5 — a 


0.5 - 0.5 p 


1 


where di is a suitable normalizing constant, which is decreasing in p, a is the pro¬ 
portion of data in the rectangle x( J =1 [q jX , rp 3 ], q jX and qj 3 are the lower and upper 
quartiles and IQR is the average of interquartile ranges. 

Simulations suggest that for p < 2 , the rule 


7 = d x nz+o^IQR 


0.5 — a 
0.5-0.5P’ 


(9) 


works quite well, with the kernel in (JTJ) , for p = 0.5(0.1)0.9, for the values of d\, 
given in Table [U for different values of k — 3, 5, p — 1, 2 and r = 1,2. A value 
of d\ which was corresponding to the minimum value of the simulated MSE of the 
estimator H^ SS (X) on a grid of values of di = 0.5(0.05)2.00 is chosen as the optimal 
value of d\. 


Remark 1 Simulations suggest that for p = 3,4, the bandwidth in (|9]) works quite 
well with the kernel in (J 6 J) under the multivariate normal distribution. We have 
not provided the suitable values of d\ for this case here for the sake of briefness. 
However, an illustrative example is provided for the case p = 3 in Section f. 


2.3 Estimation of the MSE of the estimator 

Computations show that CV^ in ([ 8 ]) has a large negative bias for estimating MSE 
of H^ SS (X). To propose a less biased estimator for MSE of H^ SS (X), we use the 
approximated MSE in Theorem [0 where CV, y is employed as an estimator of the 
term ( H 1 — H(X)) 2 . Since H 7 — H(X) is negative for suitable values of 7 , we use 
— | Lb,| as an estimator of — H{X), where 

<1 m 

D -< = ~Y. S (XH>)) , 

Tfl 

3 = 1 

Thus, our proposed estimator of the MSE of H^ SS (X) is 

M 7 = CRy + n _1 (d 2 + 2di\D^\)6(CVy + n^ 1 (d 2 + 2di|iJ 7 |)), (10) 
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Table 1: The suitable choices of d\ in (ED for the estimator of H n (X) under the 
bivariate normal distribution, for n = 30, k — 3, 5, p — 1, 2, r = 1, 2 and p = 
0.5(0.1)0.9. _ 


r 

k 

P 

0.5 

0.6 

P 

0.7 

0.8 

0.9 

1 

3 

1 

1.45 

1.40 

1.30 

1.25 

1.20 



2 

1.45 

1.45 

1.30 

1.30 

1.05 


5 

1 

1.45 

1.40 

1.30 

1.25 

1.20 



2 

1.50 

1.45 

1.45 

1.30 

1.05 

2 

3 

1 

1.50 

1.45 

1.45 

1.40 

1.20 



2 

1.45 

1.45 

1.40 

1.30 

1.05 


5 

1 

1.50 

1.45 

1.45 

1.40 

1.30 



2 

1.50 

1.45 

1.40 

1.20 

1.05 


where 


m = 


1 t>Q 
0 t < 0 ’ 


k m 


k m 


a i — i & (-X~[i] j , X[j]j ) is{x^ ) -in; 


mk 


*=1 J=1 


m(m — l)k ^ ^ 
v ' i=l jW'(=1) 


fc m 


I 1 

= ^ 2 ( x i^) I s( x [i]j) -pYl 

2—1 J = 1 2— 1 


.1=1 


and 


k m _L /U < dhll 




AT 


7 


*=1 1=1 


(/RSS( X[i]j )) 5 




/* ss 0) 




k m -Lk 

1 7 P p 


Tn = -rEE 


n(x w ) + logi/r^i))/^!) 


1“T ft fn SS ( X [i]j) 


The performance of the estimators in ([HD and (fTUl) are examined through simu¬ 
lation studies in Section 4. 


2.4 The relative efficiency 

To approximate RE of H RSS (X) with respect to H^ RS (X), we use the approximated 
MSE of Hf RS (X) obtained by Joe (1989) as 

E (H^ RS (X)-H(X)) 2 = (iJ 7 -/J(X)) 2 +n- 1 (/3 2 -2/3 1 (// 7 -JJ(X)))+0(n- 3 / 2 ), (11) 
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where 


Pi = B(x,x) dF x (x) 



B(x,y) dF(x) d F x (y) 


= / B{x,x) d F x (x) - 


and 


p 2 = A 2 (x) d F x (x) - / A{x) d F x (x) 


= / A 2 (x)dF x (x)-{l-H^) 2 , 


in which A{x) and B(x,y ) are given in (J3]) and (JT]) , respectively. Joe (1989) also 
showed that the bias of i/® RS (X) is 

E(H* RS (X)) - H(X) = (if 7 - H(X)) + n-'fh + 0{n ~ 3 / 2 ). 


As a corollary of Theorem |T] and using (III]) . the RE of H^ SS (X) relative to 
i/® RS (X) is approximated as 


RE (H™ s (X),H* rs (X)) + 


n~\P 2 - a 2 - 2(ft - - H(X))) 

(H^ - H(X)) 2 + n-\a 2 - 2«i(/J 7 - H(X )))' 


( 12 ) 


Using Cauchy-Schwartz inequality for series and consistency of the RSS sample 
we have 


> ( [ An (x) d F x (x) 

KJAn 

and consequently (3 2 > « 2 - 

Numerical computations under the bivariate normal distribution, for different 
values of k, m and p, show that that for a suitable choice of 7 , the approximated 
relative efficiency in (IT 2 T) is greater than 1. The computed values of the approximated 
relative efficiencies of H(X^) are given in Table [21 for different values of p, n and 
k, for RSS and DRSS schemes, under bivariate normal distribution. 

Since the values given in Table [ 2 ] are population parameters, different optimal 
values of 7 are needed for this computation. We used 7 = cn _1 /^” +0 ' 5p ), with suitable 
values of the normalizing constant c. It is found that, for the SRS scheme, suitable 
values of c are c = 1.35 for p = 0.9 and c = 1.30 for p = 0.8. For RSS, we used 
c = 1.40 for k = 3 and p = 0.9, c = 1.35 for k = 3 and p = 0.8, c = 1.65 for k = 5 
and p = 0.9 and c = 1.60 for k = 5 and p = 0.8. For DRSS, our suggested values 
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Table 2: Approximated relative efficiencies, under bivariate normal distribution, for 


different 

values 

of p, 

n and 

k. 









P 

0.9 






0.8 






n 

15 


30 


45 


15 


30 


45 


k 

3 

5 

3 

5 

3 

5 

3 

5 

3 

5 

3 

5 

RSS 

1.10 

1.58 

1.05 

1.33 

1.02 

1.24 

1.04 

1.31 

1.02 

1.16 

1.02 

1.21 

DRSS 

1.68 

2.04 

1.29 

1.44 

1.23 

1.28 

1.32 

1.50 

1.25 

1.32 

1.18 

1.25 


are c = 1.45 for k = 3 and p = 0.9, c = 1.40 for k = 3 and p = 0.8, c = 1.70 for 
k — 5 and p = 0.9 and c = 1.65 for k = 5 and p = 0.8. 

As one can see from Tabled all approximated values of KE(H RSS (X^), i 7 ® RS (XW)) 
are greater than 1.00. Also, the relative efficiencies decrease as p and/or n increase 
and also as k decreases. For DRSS, the relative efficiencies are greater than for RSS, 
as expected. 

Finally, using the results of Al-Saleh and Al-Omari (2002) we have, for i = 

1,..., k and x in the support of X, 



kfx(x), Fp ('^)<x< Fp (j) 
0, otherwise, 


(13) 


and consequently for each x,y in the support of X, 


lim T^2fx [{l (x)f X[i] (y) = kfx(x)fx(y). 


i =1 


Thus 

k — l 

lim «!=/?! -, and lim a 2 = /? 2 — (k — 1)(1 — FA) 2 , 

and thereby 

lim RE(i7 RSS (A:), H* rs (X)) « 1 + n~ 1 (k- 1)((1 - H^f - + H{X)) 

1—>oo 

x [(/7 7 - H(X)f + n -1 (/3 2 - (k - 1)(1 - 


3 Applications 

In addition to estimation of the entropy, the developed results might be applied for 
estimation of the Kullback-Leibler distance measure as well as the mutual informa¬ 
tion criterion. The estimator of the Kullback-Leibler distance measure can be used 
to construct homogeneity test statistics. The mutual information is well-known as a 
measure of dependence. These applications are described in the following sections. 
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3.1 Application to estimation of the mutual information 

For p > 2 and X = (I’W,!'®), the mutual information of X^> and , defined 
as 


J(A (1) , A (2) ) = (A (1) ) + H(X {2) ) - H(X) 

= [ fx log f dp, 

Js \JxWJxWj 

which is the Kullback-Leibler divergence between the joint distribution of X = 
and the product of its marginal pdfs, is used as a popular and well- 
defined nonlinear correlation criteria. 

If for example, X is distributed as the bivariate normal distribution with corre¬ 
lation parameter p, the mutual information between X (1) and is 

J(X (1) , X (2) ) = -0.5 log(l - p 2 ). 

Motivated by the above relationship, and using the fact that I(X^\ X^) > 0, a 
standardized nonlinear correlation measure based on the mutual information might 
be defined as follows 


X(X (1) , X (2) ) = 1 - e - 2/ ( x(1) ’ x(2) ) G (0,1). 


(14) 


Using the proposed estimator of the entropy, the mutual information /(A"^ 1 !, A^ 2 1) 
is estimated by 


J(A (1) , A (2) ) = H R SS (A (1) ) + tf RSS (X (2) ) - H RSS (A) 

f*’ RSS ( X i,x 2 ) 


= / log 


7. 


XWjRSS 


Mfi 


X( 2 );RSS 


M. 


dF^' RSS (x uX2 ), (15) 


where H^ SS { X^) is the the estimator of H(X^) in RSS based on X ^, i = 
1,..., k, j = 1 ,...,m, i = 1,2 and H^ ss ( X) is the estimator of H( X) based 
on (^«i> * = h • • • 7, j = l,...,m, that is 


H™ S (X) 


1 

mk 


EE'osr^1’ 


(*)j 


4v>^ x t' 


( 1 ) 

(i)j 


i =1 j =1 



Using the kernel function K p (ui,... ,u p ) = flj=i ^o( u j) anc l a similar band¬ 
width parameter for estimation of f^ ’ RSS {x i, X 2 ), fn <1,;RSS (a;i) and /,f (2);RSS (;c 2 ), it is 
straightforward that I{X^ l \ X^) > 0. The corresponding estimator of X{X iyl \ A"( 2 1) 
is obtained by plugging in the estimator of /( X^\X^) in f|14|) . 
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Table 3: The suitable choices of d\ in (JDJ) for the estimator in (fT3|) under the bivariate 
normal distribution, for n = 30, p — 2 and different values of k, r and p. 


r 

k 

0.5 

0.6 

P 

0.7 

0.8 

0.9 

1 

3 

1.55 

1.40 

1.30 

1.00 

0.70 


5 

1.50 

1.30 

1.30 

1.00 

0.70 

2 

3 

1.50 

1.40 

1.10 

1.00 

0.70 


5 

1.50 

1.40 

1.30 

1.10 

1.00 


Simulations show that the bandwidth parameter in (EJ) works well for the estima¬ 
tor in (fT5]h with p equal to dimension of X = (JW, J®), by choosing some suitable 
value of the constant d\. Table [3] presents the suitable choices of d\ in (jUJ), for the 
estimator in (fl5|) . under the bivariate normal distribution, for n = 30, p — 2 and 
different values of k, r and p = 0.5(0.1)0.9. A value of d\ which was corresponding 
to the minimum value of the simulated MSE of the estimator in fl 15 [) . on a grid of 
values of d\ = 0.5(0.05)2.00, is chosen as the optimal value of d\. 


3.2 Application to estimation of the Kullback-Leibler diver¬ 
gence 

Using the proposed estimator of the entropy, the Kullback-Leibler divergence be¬ 
tween / Y (i) and f X ( 2 ) defined as 


KL(f X (i), f X ( 2 )) = / / x( i) log 
Js 

can be estimated as 

KL(fxwJxw) = [ log 


fxW ^ A 

7- o/b 

Jx ( 2 ) / 


/, 


xWjRSS, 


X) 




(IF, 


'XWsRSSj 


X 


k m 


mk 


*=1 3 =1 


fXW-,RSS( ^( 1 )^ 

Jn ' 

f XW-,RSS ( Y (l)s 
Jn (A [lb .) 


(16) 


(17) 


Statistical properties of the estimator in (TT7T) as well as its application for the 
homogeneity test remains as an open problem. 


4 Numerical studies 

In this section, several numerical studies are conducted to examine the performance 
of the proposed estimators and to illustrate theoretical results. A simulation study 


14 











under the bivariate normal distribution is conducted to examine the performance of 
the estimators in (JSJ) and (OH . A data analysis is performed to examine the perfor¬ 
mance of the estimators for a real data example. Finally, the proposed theoretical 
results are applied to the problem of variable selection based on the estimator of the 
mutual information as a correlation measure. 

4.1 Simulation study 

In order to compare the performance of CV. y in (J8]) and M 1 in (TT0|) . two separate 
simulation studies were conducted, each with 10,000 iterations. Since, computation 
of the multiple integrals in cp and /3j, i = 1,2 takes a lot of time, we have considered 
the simple case p = 2,pi = p 2 = 1 and estimation of A^P), under the bivariate 
normal distribution with correlation parameter p, using the kernel in ([7]) and the 
bandwidth in ([9]). The variable X ® is considered as the ranker variable and the 
computations are performed for RSS and DRSS schemes. In the first simulation 
study, the values of MSE(H^ ss (X ( ' 1 ' 1 )) were computed which are given in Table 
SI for different values of n. k and p, for RSS and DRSS schemes. In the second 
simulation study, the variance and expectation of CV^ and M 7 were computed which 
are given corresponding to each of the simulated MSE{H^ s ^{X^)) in Table SI 

As one can see from Table SI the estimator M 1 is less biased than CRy, except 
for the case k = 5 and m = 3. As expected, the variance of CVy is less than that of 
My. In the simulation study, it is found that the suitable values of d\ for H^ SS (X^) 
are almost similar to those of CVy and M 7 . 

4.2 Data analysis 

We apply the results to a data set analyzed first by Platt et al. (1988). The original 
data set consists of measurements made on 399 long-leaf pine (Pinus palustris) 
trees. We use a truncated version of this data set, given in Chen et al. (2004), 
which contains the diameter (in centimeters) at breast height (A^P) and the height 
(in feet) (A^P) of 396 trees. 

To examine the distributional properties of the proposed estimators for this real 
data set, we extract 10000 samples of size n = mk = 30, with k = 3 and m = 10, 
using RSS, as well as samples of the same size using DRSS and SRS from this data 
set, by the diameter of the trees as the ranking variable. Since the population size is 
finite, the sampling is with replacement. Also, since the population does not have a 
density and the differential entropy is not defined, the true entropy of the population 
is estimated by 

1 N 

h n ( x ) = -y log / N (x f )/ s (Vi), 

2=1 
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Table 4: Simulated MSEs as well as simulted variance and expectation of the estimated MSEs, under bivariate normal 
distribution, for different values of p, n and k. 


Scheme 

RSS 

DRSS 

P 

n 

k 

Simul. MSE 

Var (M 7 ) 

E(M 7 ) 

Var(C'E y ) 

E (CVy) 

Simul. MSE 

Var(M 7 ) 

E(M 7 ) 

Var(CE 7 ) 

E (CVy) 

0.9 

15 

3 

0.0447 

0.0293 

0.0661 

0.0001 

0.0178 

0.0405 

0.0113 

0.0455 

0.0001 

0.0175 



5 

0.0425 

0.0039 

0.0732 

0.0008 

0.0623 

0.0422 

0.0031 

0.0683 

0.0007 

0.0612 


30 

3 

0.0198 

0.0038 

0.0161 

0.0001 

0.0037 

0.0172 

0.0012 

0.0080 

0.0001 

0.0036 



5 

0.0160 

0.0002 

0.0117 

0.0001 

0.0103 

0.0151 

0.0002 

0.0115 

0.0001 

0.0104 


45 

3 

0.0131 

0.0005 

0.0041 

0.0001 

0.0016 

0.0114 

0.0003 

0.0024 

0.0001 

0.0015 



5 

0.0099 

0.0001 

0.0040 

0.0001 

0.0039 

0.0088 

0.0001 

0.0041 

0.0001 

0.0041 

0.8 

15 

3 

0.0468 

0.0439 

0.0619 

0.0001 

0.0177 

0.0429 

0.0099 

0.0441 

0.0001 

0.0177 



5 

0.0446 

0.0074 

0.0814 

0.0008 

0.0442 

0.0601 

0.0018 

0.0711 

0.0008 

0.0658 


30 

3 

0.0208 

0.0077 

0.0266 

0.0001 

0.0037 

0.0184 

0.0025 

0.0103 

0.0001 

0.0036 



5 

0.0171 

0.0007 

0.0145 

0.0001 

0.0102 

0.0162 

0.0001 

0.0113 

0.0001 

0.0106 


45 

3 

0.0137 

0.0019 

0.0088 

0.0001 

0.0015 

0.0122 

0.0002 

0.0024 

0.0001 

0.0016 



5 

0.0106 

0.0001 

0.0052 

0.0001 

0.0039 

0.0097 

0.0001 

0.0041 

0.0001 

0.0041 










Table 5: The simulated biases and MSEs of the entropy and the mutual information 
estimators for the tree data._ 


Estimator 

77 RSS (Apb) 

H» R SS (IW) 

H^(X^) 

Bias 

-0.389 

-0.417 

0.010 

MSE 

0.280 

0.248 

0.604 

Estimator 

/ RSS (x«,x( 2 )) 

JD R SS( X (1), X (2)) 

/ SRS (X«,X (2)) 

Bias 

1.111 

1.069 

1.953 

MSE 

1.377 

1.210 

4.809 

Estimator 

x rss (a:w,x( 2 )) 

X DR ss(x( 1 ),X(2)) 

x srs (a : ( i ),x( 2 )) 

Bias 

0.291 

0.169 

0.316 

MSE 

0.085 

0.084 

0.100 


where /jv(x) = -yy YliLi and simillarly for I^X^, X^). Using optimal 

bandwidth and kernel function for a simple random sample of size N = 396, the full 
population estimates of the true parameters are Hn(X^) = 4.921, In(X^\ X^) = 
0.550 and X N {X^\ X&) = 0.667. 

The simulated bias and MSE of the estimators were computed which are given 
in Table [5j As it can be observed from Table [5l the estimators of /jv(X ( 1 U (2) ) 
and I N (X ( ' 1 \ X^) based on DRSS have the smallest MSEs and biases. Also, the 
estimators of H N (X^) has smallest MSE but are more biased under RSS and DRSS 
schemes. 

4.3 Application to a real data set 

In this section, we apply the proposed theoretical results to the problem of vari¬ 
able selection in regression estimation based on ranked set sampling (Yu and Lam, 
1997). We consider a real data set, consisting of measurements of different body fat 
percentage evaluations and various body circumference measurements for N =252 
men. This data set is taken from Penrose et al. (1985). Consider the following 
variables 

Y: Percent of body fat using Brozek’s equation; 

A"d); Abdomen circumference; 

X&: Weight; 

Chest circumference; 

Suppose, the regression estimator of the mean percent of body fat using Brozek’s 
equation is of interest. Three auxiliary variables A^ 1 ', X and X^ are considered 
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and suppose that one is willing to choose the two most correlated auxiliary variables 
with the variable of interest, Y. The classical correlation criteria fail to measure the 
correlation between three variables. Thus, the mutual information criterion is used 
as the variable selection tool. 

A double ranked set sample with k = 3 and m = 10, is extracted from this data 
set, with X W as the ranking variable. This sample is given in Table EE 

Based on this sample, the determinant of the correlation matrix of the sample is 
obtained as 

\R\ = 0.0077. 

The simulation results under the multivariate normal distribution with variance- 
covariance matrix £, satisfying |E| = 0.01, suggest using the bandwidth in (jUJ) with 
d\ = 0.60. The resulting estimators are 

H n ((Y)) = 3.599, Y)) = 11.226, 

H n ((X w ,X {2) )) = 7.807, H n ((X il \X^ 3 \Y)) = 10.464, 

tf n ((X (1) ,X (3) )) = 6.894, tf n ((X (2) ,X (3) , Y)) = 11.118, 

H n ((X^ 2 \X^)) = 7.679, J((X (1) ,X (2) , Y)) = 0.180, 

X((X (1) ,X (2) , Y)) = 0.302, J((X (1) ,X (3) , Y)) = 0.029, 

X((X (1) ,X (3) , Y)) = 0.057, J((X (2) ,A: (3) , Y)) = 0.160, 

and 

I((X (2) , X (3 \Y)) = 0.274. 

Since 


X((X (1) ,X (2) , Y)) >X((X (2) ,X (3) , Y)) >X((X (1) ,X (3) ,Y)), 

the variables X^ and X^ might be chosen out of the three auxiliary variables, 
based on the sample given in Table EE 
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Table 6: Double ranked set sample extracted from the body fat data set. 


Variable 


i 


J 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 


1 

8.7 

10.9 

21.4 

9.2 

5.0 

13.4 

0.0 

17.2 

14.4 

8.8 


2 

25.8 

25.8 

15.5 

20.9 

25.9 

7.5 

20.5 

25.5 

6.5 

25.9 

3 

30.8 

16.5 

19.8 

27.1 

28.1 

16.5 

28.1 

26.2 

24.0 

28.8 



i 


j 

i 

2 

3 

4 

5 

6 

7 

8 

9 

10 


1 

98.6 

99.2 

101.6 

109.3 

95.0 

86.0 

95.0 

91.6 

98.0 

82.9 

[*]j 

2 

86.1 

108.8 

86.8 

83.0 

95.0 

79.5 

79.1 

86.1 

90.8 

100.5 

3 

104.3 

74.6 

78.2 

96.4 

92.1 

88.1 

89.1 

85.3 

97.5 

79.4 



i 


J 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

x n- 

1 

177.00 

224.50 

203.25 

232.75 

178.25 

176.75 

198.5 

188.75 

202.25 

160.75 

2 

166.75 

226.75 

159.25 

173.25 

198.50 

148.50 

168.0 

166.75 

172.75 

189.75 

3 

212.00 

131.50 

126.50 

187.75 

184.25 

157.75 

184.0 

188.15 

209.25 

140.50 



i 


j 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

xff. 

1 

104.0 

113.2 

110.0 

117.5 

99.7 

97.3 

106.5 

99.1 

109.2 

93.6 

2 

92.9 

115.3 

92.3 

93.6 

106.5 

89.8 

93.0 

92.9 

99.1 

106.4 

3 

106.6 

88.6 

88.8 

101.3 

98.9 

97.5 

100.8 

96.6 

107.6 

91.2 
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Figure 1: Scatterplots for pairs of variables for the body fat data set 
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Appendix 

Proof of Lemma 1 The proof of part (i) is trivial. For the consistent ranked set 
sample, we have 



E / / a(x,y) dU n (x) d U n (y) = nE J J a{x,y) d F^ b (x) d F^(y) 

-n / a(x,y) d F x (x) d F x (y) 


= n 




'y ' a (X[i\j,X[j\j) + y ^ a(X[j 1 ]j, X[j 2 ]j) 


. j=l L*=l 

m k k 

£££ A*2]j2) 

*1=1 *i"i 


*1^*2 


-nj J a(x, y) dF x (x) d F x (y) 

= / a(x, x) dF x (x) — nm^ 1 / / a(x, y) dF x (x) dF x (y) 


+ nm 1 j j a(x,y)— ^ f X[h] (x)fx [i2] (y) dx dy 


*1^*2 


= / a(x,x) d F x (x) 


»= l 



1^2 I I a ( x ^y)fx l i l (x)fx [i] (y) dxdy. 


Proof of Theorem 1 By the Taylor expansion of the integrands and using the fact 
that 

fn SS ( x ) - A 7 (x) = n“ 1/2 J ±K p (V— X 


d U n (y), 
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we can write 


= / log (f* bb (x)) dF^(x) = / (log (f^ b (x)) - log(A 7 (*))) d F x (x) 

Js Js 


n 1/2 / (log(/^ ss (x)) -log(A 7 (z))) dU n (x) 


+ / log(A 7 (x)) dF x {x) +n 1/2 / log(A 7 (x)) dU n (x) 

Js Js 


(fn°( X ) ~ A 7 (x)) 


dF x (x) 


~ A 7 (x))- 
2A 7 (x) 2 


dF Y (x) 


n 


A 7 0*0 </s 

1/2 r ( /r s (x)- a 7 (x)) 

Js A 7 (*) [) 

I log(A 7 (x)) dF x (x)+n _1/2 / log(A 7 (x)) d£/ n (x) + 0(n _3/2 ). 

'5 -/S' 


Thus 

H* SS (X) = H y - n~ 1/2 



\k v [^ 

'yP P \ 7 


s JS 


A 7 ( v) 


dU n (y) d F x {x) 


- / log(A 7 (x)) d[/ n (x) 


n 


-i 



s ./s 


A 7 (l/) 


d[/ n (?/) df/ n (a;) 





5 JS JS 


2A 7 (x) 2 


d^„(2/) dD n (z) dF x (x) 


+ 0(n 3/2 ). 


By applying Lemma [H the required result follows. 


□ 
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